[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode #4386

DrRyanHuang · 2025-10-14T02:51:09Z

本 PR 的主要操作是将SOT warmup的过程延后，在CUDAGraph的Capture阶段进行warmup

本PR的前置PR

FastDeploy：[SOT][Cudagraph] Remove BreakGraph of #3302 && update CustomOp #3694
Paddlepaddle主仓库：[CUDAGraph] Remove CUDAGraph replay after capture and use the same device context in CUDA Graph Paddle#75954

这个PR终于跑通 SOT + CUDAGraph + 开启子图切分的整个流程，在这个PR中梳理一下整个过程：

单卡模型

以下是 @zyfncg 的前置PR，SOT下实现CudaGraph子图捕获功能，我们快速跑通了 ERNIE45T 0.3B✅

但是依旧存在显存拷贝的问题，如果输入的位置发生改变，则将新输入Copy到Capture的位置

        // https://github.com/PaddlePaddle/Paddle/blob/89f4bd92f49e15a9e1803a9e582526b2b8e4557d/paddle/fluid/framework/new_executor/instruction/cuda_graph_instruction.cc#L179-L187
        if (tensor->data() != input_tensors_.at(i).data()) {
          LOG(WARNING) << "The input [" << i << "] tensor addr for "
                       << "cuda graph is changed. Pay attention to this!";
          if (phi::is_gpu_place(tensor->place())) {
            const auto* dev_ctx =
                phi::DeviceContextPool::Instance().Get(place_);
            phi::Copy(*dev_ctx, *tensor, place_, false, &input_tensors_.at(i)); // <----- 这一行
          }
        }

为了解决这个问题，#3302 添加了 append_attention_with_output，在运行 append_attention 之前，优先创建一个 empty Tensor 作为 append_attention 的输出——在外部管理 append_attention 的显存

#3694 修复了 #3302 导致的打断，#4340(是 #3694 的一部分)移除了一些无用的输出，避免动静不统一导致的BUG

#3694 依赖 Paddle 主框架的两个PR:

自定义算子部分的改动: [CustomOp] Relax output count checks for inplace outputs Paddle#75086
消除lower过程中的 memcpy: [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy && Add CUDAGraph unitest Paddle#75078

到此为止，单卡的 ERNIE45T 21B和0.3B 都能跑通✅，且不存在 Copy，但是多卡运行会遇到CUDA700的问题

多卡模型

遇到到第一个CUDA700问题，不开CUDAGraph，只开SOT就能复现

用 cuda-gdb 分析这个CUDA700的问题，定位到是 Custom Allreduce 的问题，可暂时通过 --max-num-batched-tokens 2000 规避

与 @zhink 沟通后，可通过调大 Custom Allreduce 的 Buffer 大小来规避，即：

class CustomAllreduce:

    _SUPPORTED_WORLD_SIZES = [2, 4, 6, 8]

    # max_size: max supported allreduce size
-   def __init__(self, group: Group, max_size: int = 8192 * 1024) -> None:
+   def __init__(self, group: Group, max_size: int = 8192 * 1024 * 32 * 2) -> None:

        # This is a buffer for storing the tuples of pointers pointing to
        # IPC buffers from all ranks. Each registered tuple has size of
        # 8*world_size bytes where world_size is at most 8. Allocating 8MB
        # is enough for 131072 such tuples. The largest model I've seen only
        # needs less than 10000 of registered tuples.
-      self.rank_data = paddle.empty([8 * 1024 * 1024], dtype=paddle.uint8)
+      self.rank_data = paddle.empty([8192 * 1024 * 32 * 2], dtype=paddle.uint8)

（后面发现这个增大 Buffer 的策略似乎无效了）
此时SOT + Custom Allreduce 可以跑通推理流程 ✅

但开启 CUDAGraph 遇到 mp_allreduce_sum 对应的 DeviceContext 存在 cudagraph allocator 为空指针的问题
我把所有 Instruction 对应的 DeviceContext 指针打印了出来，统计了一下，共有1个CPUContext+2个GPUContext
出现次数分别是几十次、几千次和几万次，而动态图+CUDAGraph只有1个CPUContext+1个GPUContext

定位到是 phi::DeviceContext* ParseDeviceContext 这个函数将 GPUContext (有cudagraph allocator)转化成了另一个 GPUContext(无cudagraph allocator)，这里为了先跑通，就先直接 return origin_dev_ctx; 了，（后续PR: PaddlePaddle#75954）但依旧会有 CUDA700 的问题

中间其实也尝试了很多其他方法，和老代码 battle 了好久，遇到了多 CUDA Stream 的问题，CUDA90X之类的，这时，@zyfncg 说：

你先别管多流的问题，就按之前的跑，DeviceContext改对之后，暴露的新CUDA700的问题，才是目前需要解的

好，那就先解这个CUDA700，悲催的是，用之前的方法：cuda-gdb 分析，会出现连环 core dump 的问题，在生成 coredump 文件中，又报一个 CUDA700 🤦‍♂️

@zyfncg 又问了一个关键的问题：目前的问题是 Capture 阶段还是 Replay 阶段？我们把 Capture 过程中的 Replay 全都注释掉看看

  // paddle/fluid/framework/new_executor/instruction/cuda_graph_instruction.cc
  if (*cuda_graph_state_ref_ == 2 && cuda_graph_ == nullptr) {
	......
	// 以下是 Capture 阶段
    platform::BeginCUDAGraphCapture(
        place_, cudaStreamCaptureModeRelaxed, cuda_graph_capture_pool_id_);

    auto RecordTensorsForReplay = [&](const std::vector<Variable*>& vars) {
      ......
      return record_tensors;
    };

    // record the input tensors for replay
    input_tensors_ = RecordTensorsForReplay(input_vars_);
    interpreter_->Run({}, false);

    // record the output tensors for replay
    output_tensors_ = RecordTensorsForReplay(output_vars_);
    cuda_graph_ = platform::EndCUDAGraphCapture();
	
    // 以下是 Capture 阶段中的 Replay
    cuda_graph_->Replay(); // 这里注释就好了
  }

把 Capture 过程中的 Replay 全都注释掉后，服务就能启动了，这其实说明是Replay阶段报的错，Capture阶段没问题

此时忽然想到是不是还是CustomAllreduce导致的问题，于是加上 --disable-custom-all-reduce 跑一下，没想到直接跑通了
也就是 SOT+CUDAGraph+子图切分（禁用 CustomAllReduce）可以跑通整个流程✅

另外也可以确认，目前CUDA700的问题还是 CustomAllReduce 的问题，接下来就从动静统一的角度（动态图能跑通CUDAGraph，但SOT跑不通）来考虑差异在哪

最早怀疑的是，一些初始化操作（如下）静态图没有和动态图对齐

FastDeploy/fastdeploy/distributed/custom_all_reduce/custom_all_reduce.py

Lines 100 to 101 in 5abf597

    
           self._ptr = init_custom_all_reduce(self.meta_ptrs, self.rank_data, rank, self.full_nvlink) 
        
           register_buffer(self._ptr, self.buffer_ptrs)

但在SOT模式中，__init__ 函数还是以动态图跑的，没啥区别，也看了下函数实现，就是初始化 buffer + 创建一个 C++侧的CustomAllreduce 对象，并返回其指针到Python端

其次怀疑的是，if 分支这里的 should_custom_ar 函数：

FastDeploy/fastdeploy/distributed/communication.py

Lines 53 to 71 in 5abf597

    
           @paddle.jit.marker.unified 
        
           def tensor_model_parallel_all_reduce( 
        
               input_: paddle.Tensor, 
        
               group_: paddle.distributed.communication.group.Group = None, 
        
           ) -> paddle.Tensor: 
        
               """All-reduce the input tensor across model parallel group.""" 
        
               global _TP_AR 
        
               if _TP_AR is not None and _TP_AR.should_custom_ar(input_): 
        
                   # TODO: supports different_group custom allreduce 
        
                   _TP_AR.custom_all_reduce(input_) 
        
               elif paddle.in_dynamic_mode(): 
        
                   if group_ is not None: 
        
                       dist.all_reduce(input_, group=group_) 
        
                   else: 
        
                       hcg = fleet.get_hybrid_communicate_group() 
        
                       mp_group = hcg.get_model_parallel_group() 
        
                       dist.all_reduce(input_, group=mp_group) 
        
               else: 
        
                   dist.all_reduce(input_)

这个函数内选用 paddle.all_reduce 还是 custom_all_reduce，问题在这里吗？那先看看这个 should_custom_ar ：

FastDeploy/fastdeploy/distributed/custom_all_reduce/custom_all_reduce.py

Lines 134 to 145 in 5abf597

    
           def should_custom_ar(self, inp: paddle.Tensor): 
        
               if self.capturing: 
        
                   return True 
        
               inp_size = inp.shape[0] * inp.shape[1] * inp.element_size() 
        
               # custom allreduce requires input byte size to be multiples of 16 
        
               if inp_size % 16 != 0: 
        
                   return False 
        
               # for 4 or more non NVLink-capable GPUs, custom allreduce provides 
        
               # little performance improvement over NCCL. 
        
               if self.world_size == 2 or self.full_nvlink: 
        
                   return inp_size < self.max_size 
        
               return False

刚好这个函数的调用在 @paddle.jit.marker.unified 这个装饰器之下，内部可以直接print，不用担心打断

由于静态图中存在-1动态维度的情况，可以直接打印 inp.shape[0], inp.shape[1] 的值，只有 inp.shape[0] 存在为 -1 ，而inp.shape[1] * inp.element_size() 刚好可以被16整除，所以 inp_size % 16 == 0，无需担心这个if分支动静不统一

那 self.capturing 这个if 分支呢？之前没怎么注意过，在 custom_all_reduce 中，只有在 Capture 的时候才会走第一个 if self.capturing:分支。但是，之前 SOT Warmup 在 Capture Model 前面，也就是IR图在 CUDAGraph Capture 之前就已经确定了，走 else 分支，这就导致 replay 的时候也走 else 分支，这是不对的。

这里需要给SOT的执行过程也加上一个上下文管理器 capture_custom_allreduce 才能用这个 self.capturing，也就是如下

FastDeploy/fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py

Lines 136 to 138 in 1e59905

    
           with capture_custom_allreduce(): 
        
               with self.cuda_graph_manager.run_impl_guard(): 
        
                   entry.runnable(**kwargs)

并在 capture_model 函数头上加 @sot_warmup_guard(True) 这个sot warmup的装饰器：
https://github.com/cattidea/FastDeploy/blob/1b9f351d219013f1a69db355736ee33bbc035866/fastdeploy/worker/gpu_model_runner.py#L1672-L1673

也就是这个 PR 中的内容，也就是 SOT+CUDAGraph+子图切分（不禁用 CustomAllReduce）整个流程可以跑通✅

遗留问题

多卡不开CUDAGraph，只开SOT会有这个问题：

W1016 15:31:22.741484 87300 pir_interpreter.cc:2040] pd_op.weight_only_linear raises an exception std::runtime_error: 
[fpA_intB Runner] Failed to run cutlass fpA_intB gemm. Error: Error Internal
terminate called after throwing an instance of 'std::exception'
  what():  std::exception

启动参数加上 --max-num-batched-tokens 4000 就好了，但改成 8000 会有问题，动态图则没有这个限制，还是 CustomAllReduce 导致的

cc @SigureMo @zyfncg

paddle-bot · 2025-10-14T02:52:32Z

Thanks for your contribution!

gongshaotian · 2025-10-16T01:46:21Z

fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py

            entry.captured = True
-            with self.cuda_graph_manager.run_impl_guard():
-                entry.runnable(**kwargs)
+            with capture_custom_allreduce():


静态图也用custom all reduce 对吧

对，目前静态图 Custom AllReduce 和 Paddle AllReduce 都支持

但是使用Custom AllReduce，需要加参数 --max-num-batched-tokens 500，后面这个数小于500就行，具体原因后面继续排查~

gongshaotian · 2025-10-16T01:51:22Z

fastdeploy/worker/gpu_model_runner.py

                self.proposer.update_task_chunk_prefill(task)
            task.chunk_idx += 1

+    @sot_warmup_guard(True)


SOT的 Warm Up 延后是为了避免 custom all reduce 的什么问题呢

在 custom_all_reduce 中，只有在 Capture 的时候才会走第一个 if self.capturing:分支
但是，之前 SOT Warmup 在 Capture Model 前面，也就是IR图在 CUDAGraph Capture 之前就已经确定了，走 else 分支，这就导致 replay 的时候也走 else 分支，这是不对的

FastDeploy/fastdeploy/distributed/custom_all_reduce/custom_all_reduce.py

Lines 208 to 222 in 5abf597

def custom_all_reduce(self, input: paddle.Tensor) -> Optional[paddle.Tensor]:

"""The main allreduce API that provides support for cuda graph."""

if self.capturing:

lib = cuda_wrapper.CudaRTLibrary()

stream = paddle.device.current_stream()

stream_capturing = lib.cudaStreamIsCapturing(stream)

if stream_capturing.value == 1:

# 1 is cudaStreamCaptureStatusActive: The stream is capturing.

return self.all_reduce(input, input, registered=True)

else:

# If warm up, mimic the allocation pattern since custom

# allreduce is out-of-place.

return paddle.empty_like(input)

else:

return self.all_reduce(input, input, registered=False)

gongshaotian

LGTM

zhink · 2025-10-16T08:14:57Z

Custom Allreduce 可以都增加到8192 * 1024 * 32 * 2

adapt custom all reduce for sot

2f3134c

Merge branch 'develop' into adapt_custom_all_reduce4sot

60c5e7b

gongshaotian reviewed Oct 16, 2025

View reviewed changes

gongshaotian approved these changes Oct 16, 2025

View reviewed changes

Merge branch 'develop' into adapt_custom_all_reduce4sot

3b4a9fd

gongshaotian merged commit b87e2c6 into PaddlePaddle:develop Oct 16, 2025
24 of 27 checks passed

DrRyanHuang deleted the adapt_custom_all_reduce4sot branch October 16, 2025 11:31

DrRyanHuang changed the title ~~Add support for custom all-reduce operators under SOT mode~~ [SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode Oct 16, 2025

DrRyanHuang mentioned this pull request Oct 20, 2025

[CUDAGraph] Remove CUDAGraph replay after capture and use the same device context in CUDA Graph PaddlePaddle/Paddle#75954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode #4386

[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode #4386

Uh oh!

DrRyanHuang commented Oct 14, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Oct 14, 2025

Uh oh!

gongshaotian Oct 16, 2025

Uh oh!

DrRyanHuang Oct 16, 2025 •

edited

Loading

Uh oh!

gongshaotian Oct 16, 2025

Uh oh!

DrRyanHuang Oct 16, 2025

Uh oh!

gongshaotian left a comment

Uh oh!

zhink commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	self._ptr = init_custom_all_reduce(self.meta_ptrs, self.rank_data, rank, self.full_nvlink)
	register_buffer(self._ptr, self.buffer_ptrs)

	@paddle.jit.marker.unified
	def tensor_model_parallel_all_reduce(
	input_: paddle.Tensor,
	group_: paddle.distributed.communication.group.Group = None,
	) -> paddle.Tensor:
	"""All-reduce the input tensor across model parallel group."""
	global _TP_AR
	if _TP_AR is not None and _TP_AR.should_custom_ar(input_):
	# TODO: supports different_group custom allreduce
	_TP_AR.custom_all_reduce(input_)
	elif paddle.in_dynamic_mode():
	if group_ is not None:
	dist.all_reduce(input_, group=group_)
	else:
	hcg = fleet.get_hybrid_communicate_group()
	mp_group = hcg.get_model_parallel_group()
	dist.all_reduce(input_, group=mp_group)
	else:
	dist.all_reduce(input_)

	def should_custom_ar(self, inp: paddle.Tensor):
	if self.capturing:
	return True
	inp_size = inp.shape[0] * inp.shape[1] * inp.element_size()
	# custom allreduce requires input byte size to be multiples of 16
	if inp_size % 16 != 0:
	return False
	# for 4 or more non NVLink-capable GPUs, custom allreduce provides
	# little performance improvement over NCCL.
	if self.world_size == 2 or self.full_nvlink:
	return inp_size < self.max_size
	return False

	with capture_custom_allreduce():
	with self.cuda_graph_manager.run_impl_guard():
	entry.runnable(**kwargs)

	def custom_all_reduce(self, input: paddle.Tensor) -> Optional[paddle.Tensor]:
	"""The main allreduce API that provides support for cuda graph."""
	if self.capturing:
	lib = cuda_wrapper.CudaRTLibrary()
	stream = paddle.device.current_stream()
	stream_capturing = lib.cudaStreamIsCapturing(stream)
	if stream_capturing.value == 1:
	# 1 is cudaStreamCaptureStatusActive: The stream is capturing.
	return self.all_reduce(input, input, registered=True)
	else:
	# If warm up, mimic the allocation pattern since custom
	# allreduce is out-of-place.
	return paddle.empty_like(input)
	else:
	return self.all_reduce(input, input, registered=False)

[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode #4386

[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode #4386

Uh oh!

Conversation

DrRyanHuang commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

单卡模型

多卡模型

遗留问题

Uh oh!

paddle-bot bot commented Oct 14, 2025

Uh oh!

gongshaotian Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gongshaotian Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

zhink commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DrRyanHuang commented Oct 14, 2025 •

edited

Loading

DrRyanHuang Oct 16, 2025 •

edited

Loading